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Next generation sequencing (NGS) instruments produce an unprecedented amount of 
sequence data at contained costs. This gives researchers the possibility of designing stud- 
ies with adequate power to identify rare variants at a fraction of the economic and labor 
resources required by individual Sanger sequencing. As of today, few research groups 
working in plant sciences have exploited this potentiality, showing that pooled NGS pro- 
vides results in excellent agreement with those obtained by individual Sanger sequencing. 
The aim of this review is to convey to the reader the general ideas underlying the use of 
pooled NGS for the identification of rare variants. To facilitate a thorough understanding 
of the possibilities of the method, we will explain in detail the possible experimental and 
analytical approaches and discuss their advantages and disadvantages. We will show that 
information on allele frequency obtained by pooled NGS can be used to accurately com- 
pute basic population genetics indexes such as allele frequency, nucleotide diversity, and 
Tajima's D. Finally, we will discuss applications and future perspectives of the multiplexed 
NGS approach. 
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INTRODUCTION 

Progress in genetics and genomics has the potential to greatly 
benefit plant sciences in many aspects, from crop improvement 
(Varshney et al., 2005) to forest trees management (Neale and Kre- 
mer, 201 1; Harfouche et al., 2012). One of the main advances that 
researchers are expecting is the elucidation of the genetic contri- 
bution to complex phenotypes. Despite the efforts devoted to this 
aim and the progress in knowledge that such efforts produced, 
the genetic contribution to complex phenotypes still remains 
largely unexplored, and a substantial proportion of the heritability 
of complex traits is not explained by identified polymorphisms. 
Several researchers addressed the problem of "missing heritability," 
and suggested several determinants of complex traits, including 
rare variants (Eichler etal., 2010). 

In the last decade, researchers focused on genetic association 
studies, according to the hypothesis that most of the genetic vari- 
ance was due to common variants with moderate effect (Lander, 
1996; Risch and Merikangas, 1996). Association studies allowed 
the identification of several polymorphisms, cumulatively explain- 
ing usually less than 50% of the heritability. This prompted 
researchers to acknowledge that the common variant/common 
disease hypothesis could not explain a large proportion of genetic 
variance, and raised again the interest for the study of rare variants 
(Bodmer and Bonilla, 2008). 

In plant science, the identification of variants affecting complex 
traits has practical implications for the development of breed- 
ing programs. Researchers working on experimental organisms 
learnt how to induce mutations in DNA using ionizing radiations 



even before the DNA structure became actually known (Muller, 
1927); this gave rise to the reverse genetic approach to study 
gene function. One of the best-known evolutions of traditional 
mutagenesis in plant science is TILLING, in which traditional 
chemical mutagenesis is followed by a high-throughput screening 
for point mutations (McCallum etal, 2000). TILLING has then 
been adapted to the discovery of polymorphisms in natural popu- 
lations and termed ecotilling (Comai et al., 2004). Ecotilling allows 
the effective screening of large samples and the identification of 
rare genetic variants at reduced costs, by reducing the amount 
of sequencing required. As an alternative, several researchers 
focused on individual Sanger sequencing of large samples, 
although with relatively large investments in labor and reagents 
(Marroni etal, 2011b). 

The advent of next generation sequencing (NGS), also referred 
to as deep sequencing and massively parallel sequencing, revolu- 
tionized the fields of genetics and genomics (Mardis, 2008). NGS 
enables researchers to obtain large amount of sequence data at 
relatively low cost so that a single research group can easily obtain 
the entire genome of a human individual (Wheeler etal., 2008). 

With a shift of perspective, the same amount of sequencing data 
can be used to obtain sequence information in a limited number of 
PCR amplicons for a large number of subjects (Ingman and Gyl- 
lensten, 2009; Out etal., 2009). Plant scientists just discovered the 
potential of this approach and deemed it as a potential evolution 
of TILLING and ecotilling (Marroni et al, 201 la; Tsai et al, 201 1). 
As of today, four studies have faced the problem of identifying 
variants in candidate regions in large plant populations via NGS 
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(Maughan etal., 2010; Kharabian-Masouleh etal., 2011; Marroni 
etal., 2011a; Tsai etal., 2011). We will review the four studies to 
give a perspective of the different options that are currently avail- 
able to perform the screening for rare variants, and emphasize 
the possible contribution of this approach to plant genetics and 
genomics in the next few years. 

PROLOG: THE ORIGINS OF POOLED MULTIPLEXED NGS 

Next generation sequencing is considered the ideal tool to perform 
follow-up sequencing of interesting regions identified by genome- 
wide association or linkage studies. This has been made possible 
by the use of DNA barcoding in NGS experiments (Parameswaran 
etal, 2007; Craig etal, 2008). DNA barcoding also referred to 
as multiplexing allows the sequencing of target regions in several 
subjects retaining the individual information. Commercial kits to 
perform multiplexed NGS experiments are now available, opening 
the way to several application studies. Plant scientists also took 
advantage of this technology to sequence plant chloroplasts in 
multiplex (Cronn etal., 2008). 

The first experiments of multiplex sequencing had limitations 
imposed by the relatively low throughput of the newly released 
next generation sequencers. As the throughput of next generation 
instruments increased, researchers became interested in applying 
NGS to the screening of candidate genes in large samples. This 
prompted several research groups to perform NGS experiments 
on pooled DNA (Ingman and Gyllensten, 2009; Out etal, 2009). 
Out and collaborators sequenced the candidate gene MUTYH in 
287 cancer patients and compared results of multiplexed NGS 
with individual Sanger sequencing. Using variants identified by 
Sanger as a gold standard they identified 2 false negatives, 4 false 
positives, and 15 true positives. In addition, they showed that the 
frequency of the variants identified by multiplexed NGS and indi- 
vidual Sanger sequencing were strongly correlated. Ingman and 
Gyllensten (2009) obtained a similar result. They used NGS to 
sequence the gene TSCOT in pooled DNA of 96 patients. They 
selected 16 identified polymorphisms for which Taqman assays 
were available. They then compared allele frequencies estimated 
by pooled NGS and by Taqman assays and observed a strong 
correlation. 

In summary, the above-mentioned studies provided the basis 
for multiplexed NGS (Parameswaran et al., 2007; Craig et al., 2008) 
and pooled NGS (Ingman and Gyllensten, 2009; Out et al, 2009). 
Plant scientists seeking to identify carriers of rare variants took 
inspiration from work published in human genetics. They com- 
bined multiplexing and pooling in a single experiment with the 
aim to minimize the amount of work needed for the discovery of 
rare variants and the identification of the carriers of such variants 
(Marroni et al, 201 la; Tsai et al, 201 1). 

THE QUEST FOR RARE VARIANTS 

Although there is no universal definition of rare variant, a fre- 
quency threshold of 1% is usually considered a reasonable bound- 
ary between rare and common variants (Bodmer and Bonilla, 
2008). The use of pooled NGS to identify rare variants requires 
the screening of a large number of individuals for mutations. 
According to the normal approximation to the binomial distri- 
bution, a sample size of about 150 chromosomes (75 individuals 



for diploid species) is required to have a 95% probability of sam- 
pling a variant with a 1% frequency. Studies explicitly aiming 
at the identification of rare variants (Marroni etal., 2011a; Tsai 
etal., 2011) require larger sample size than studies not focus- 
ing on rare variants (Maughan etal., 2010; Kharabian-Masouleh 
etal., 2011). Marroni and colleagues investigated 768 poplar 
(Populus nigra) accessions and Tsai and colleagues sequenced 768 
rice (Oryza sativa) accessions and 768 wheat (Triticum durum) 
accessions, Kharabian-Masouleh and colleagues sequenced 233 
rice accessions, and Maughan and colleagues sequenced 60 
Arabidopsis thaliana accessions. 

Given the above definition, a rare variant might be present 
in one or a few out of hundreds of sequenced chromosomes. 
Experiments focusing on the identification of rare variants should 
therefore be carefully designed to increase researchers' ability to 
discriminate mutations from sequencing errors or other kinds of 
bias. All the reviewed studies selected NGS of pooled libraries 
as the best option for SNP discovery in large samples. How- 
ever, each research group applied different variations to the study 
design according to the research question they were trying to 
answer. Different approaches have been used regarding (a) selec- 
tion of the target region, (b) DNA amplification, (c) pooling, 
and (d) multiplexing. We summarize in Table 1 the different 
modifications that each group brought to the basic experimental 
design. 

The reviewed studies used outcrossing species, such as P. nigra, 
and self-pollinating species, such as T. durum and O. sativa. From 
a technical point of view, identification of rare variants can be 
performed in the same way in self-pollinating and in outcrossing 
species. However, if the interest lies in loss-of-function variants 
likely to affect fitness, the study design might differ. Loss-of- 
function variants are less likely to occur in self-pollinating species, 
in which such mutations have a high probability to be observed 
in homozygous state. Studies aimed at the identification of loss- 
of-function variants in self-pollinating species are performed on 
mutant populations (Tsai et al., 201 1), while loss-of-function vari- 
ants have been successfully identified in natural populations of 
outcrossing species (Marroni et al., 201 la). 

SELECTION OF THE TARGET REGION 

Identification of rare variants is at present not feasible at a genome- 
wide scale; researchers need to enrich their libraries with DNA 
fragments originating from the genomic regions of interest. The 
length of the selected region has a crucial role in the choice of the 
enrichment method. Studies focusing on small region used PCR 
amplification (Marroni et al., 201 la; Tsai et al., 201 1) or long-range 
PCR (Kharabian-Masouleh etal., 2011). If the selected regions 
are larger than a few hundred Kb, then PCR is not the optimal 
method. One possible alternative is the "reduced representation 
libraries" or "genome reduction" approach, consisting in the use 
of restriction endonucleases to randomly select a given number of 
genomic regions (Maughan etal., 2010). 

PCR amplification is better suited when the interest is focused 
on a limited number of candidate genes to be screened in a very 
large number of accessions. Marroni and colleagues used PCR 
amplification to select candidate genes spanning less than 50 Kb 
of the genome, and sequenced them in 768 accessions. Genome 
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Table 1 | List of different procedures applied to the basic pooled NGS approach with the respective advantages and disadvantages. 



Procedure 


Advantage 


Disadvantage 


Reference 


Amplification of individual DNA 


Increases accuracy 


Requires a large number of PCRs 


Marroni etal. (2011a) 


Select target via PCR 


Precisely select a small region. Very 


Difficult to perform PCR on large 


Kharabian-Masouleh etal. (2011), 




large samples can be sequenced 


regions 


Marroni etal. (2011a), Tsai etal. (2011) 


Select target via genome reduction 


Selection of large genomic regions 


It is not easy to sequence large 


Maughan etal. (2010) 






regions in very large samples 




Use of multiplexing (barcoding) 


Facilitates follow-up. Facilitates 


Increases the number of libraries 


Marroni etal. (2011a), 




detection of rare variants 


to be prepared 


Maughan etal. (2010), Tsai etal. (2011) 


Multidimensional pooling 


Reduces (or eliminates) need of 


Increases the number of libraries 


Tsai etal. (2011) 




follow-up sequencing 


to be prepared 




Use of high-fidelity polymerase 


Decreases error rate, increases 


Increases reagents cost per 


Marroni etal. (2011a) 




accuracy 


reaction 





reduction is better suited to survey relatively large genomic regions 
in a more limited number of accessions. Maughan and collab- 
orators sequenced a target region spanning about 4 Mb in 60 
individuals using Roche GS-FLX 454 and a region spanning about 
700 Kb using Alumina GA II. 

POOLING STRATEGIES, TARGET ENRICHMENT, AND DNA 
AMPLIFICATION 

Illumina NGS instruments allow the generation of a large amount 
of sequence data in a single run. Each run is divided into 8 or 
16 independent lanes of a flow cell. At the time the studies were 
performed, each lane could generate "only" one billion base pairs. 
Sequencing of a region spanning 50 Kb in a single individual in 
a sequencing lane would have resulted in an average coverage of 
20,000-fold. Sequencing pooled DNA of 20 individuals in a single 
lane would have resulted in a 1000-fold individual coverage. In 
order to save sequencing resources, sequencing of pooled samples 
in a single lane was required (Maughan etal., 2010; Kharabian- 
Masouleh etal, 2011; Marroni etal, 2011a; Tsai etal, 2011). The 
need for analyzing multiple samples in a single lane is even stronger 
today, given that the new next generation sequencers can easily 
generate more than 10 Gb per lane. 

The approaches chosen to design the pooling strategies vary. 
Some researchers minimized the number of PCRs, and ampli- 
fied pooled DNA (Kharabian-Masouleh etal., 2011; Tsai etal., 
2011). However, DNA samples from different individuals may 
not be amplified with similar efficiency in PCRs, creating random 
biases (Kharabian-Masouleh et al., 201 1). To avoid PCR bias, other 
researchers performed individual PCRs with the aim of obtaining 
less bias in the abundance of PCR products for each individual 
(Marroni etal., 2011a). Amplification of pooled DNA reduces 
costs, while individual amplification aims at increasing accuracy. 
Finally, when the population size is not prohibitively large, it is 
possible to uniquely tag each accession with a barcode (Maughan 
etal., 2010); we refer to this approach as multiplex sequencing in 
contrast to pooled multiplex sequencing, where each barcode tags 
a pool of individuals. 

In pooled multiplex sequencing, the size of each pool is a criti- 
cal issue. Having a large number of accessions in each pool allows 



saving sequencing resources and minimizes efforts for library 
preparation. On the other hand, having large pools makes identi- 
fication of singleton variants difficult, because of the decrease in 
signal-to-noise ratio. In addition, many studies aim at the follow- 
up and phenotypic characterization of subjects carrying a given 
mutation. In such cases, all the accessions composing the pool 
have to undergo individual sequencing. The larger the pool size, 
the larger the number of subjects that needs to be sequenced for 
follow-up (Figure 1). The reviewed studies chose a pool size of 
64-96 subjects, corresponding to 128-192 chromosomes. Thus, a 
mutation occurring only once in the pool would have a frequency 
of 0.5-0.8%, which can be distinguished from sequencing and 
amplification errors (Marroni et al., 201 la; Tsai et al., 201 1). 

A final option in pooling is multidimensionality. The plain 
(unidimensional) pooling strategy implies that the n accessions to 
be sequenced are divided into p pools, each including 5 accessions, 
where n = px s. Multidimensional pooling consists in performing 
more than one pooling layer. Tsai and colleagues performed a 
bidimensional pooling by deploying 8 accessions in each of the 96 
wells of a plate, and then created 8 row pools (64 subjects each) 
and 12 column pools (96 subjects each). Each accession was thus 
sequenced twice; the total number of required libraries was 20. 
The increase in library preparation effort was compensated by the 
advantage that follow-up of a variant appearing only once in the 
whole sample required to sequence only eight subjects. Tsai and 
colleagues also performed three-dimensional pooling (Figure 2). 
Accessions were deployed in 12 plates of 64 wells (one accession per 
plate) . Pooling was performed by ( 1 ) pooling all the accessions in a 
plate, thus creating 12 pools of size 64; (2) pooling over columns, 
creating 16 pools of 48 individuals each; and (3) pooling over 
rows, creating 16 pools of 48 individuals each. The total number 
of libraries required by this pooling scheme was 44. The increase 
in the number of required libraries was balanced by the fact that 
no follow-up sequencing was needed, since each accession was 
univocally defined by its coordinates in the three pooling schemes. 

MULTIPLEXING 

Once PCR has been performed for each selected amplicon, PCR 
products of the different amplicons are pooled in an equimolar 
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Whole sample 

(768 subjects) 



♦ Common variant 
• Rare variants 

■ Indexl 

■ Index2 
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Index12 




Pool 1 

(64 subjects) 



Pool 2 

(64 subjects) 



Pool 3 

(64 subjects) 



Pool 12 

(64 subjects) 



FIGURE 1 | The pooling strategy used by Marroni etal. (2011a). 



12 D-pools of 64 



16 column pools 
of 48 





16 row pools of 48 



1 m 



12 plates of 64 individuals = 768 



FIGURE 2 | The three-dimensional pooling strategy used byTsai etal. 
(2011, reprinted with permission). 



amount for library preparation. Commercially available kits allow 
tagging of each library with a short DNA sequence as a barcode 
to perform multiplexed experiments. This enables the sequencing 
of several libraries in a single lane, thus reducing the number of 



lanes without losing ability of discriminating between libraries. 
Some authors used this feature in combination with pooling 
(Marroni etal., 2011a; Tsai etal., 2011). The use of multiplex- 
ing increases the number of libraries that need to be prepared, 
but facilitates the identification of carriers of an interesting vari- 
ant. In Figure 1, we show how a sample of 768 accessions can 
be divided into 12 pools of 64 subjects each, by preparing 12 
libraries and tagging each library with a unique DNA barcode 
(index). When a variant of interest is identified, researchers often 
need to identify the carrier in order to analyze the effects of 
the variant on the phenotype. The use of multiplexing dras- 
tically reduces the amount of follow-up sequencing needed to 
identify the mutation carrier. The combined use of multidimen- 
sional pooling and multiplexing results in a greater number of 
libraries, but also in a further reduction of the required follow-up 
sequencing. 

Tsai and collaborators, using three-dimensional pooling and 
multiplexing, were able to identify the carriers of a given mutation 
without the need of performing additional individual sequencing. 

In one of the reviewed studies, there was no interest in the exact 
identification of the carriers of interesting variants and researchers 
performed pooled sequencing without the use of multiplexing 
(Kharabian-Masouleh et al, 201 1). 
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When the number of individuals is limited, it is possible to 
construct a library for each individual and use a DNA barcode 
to uniquely tag each sequenced accession. Maughan and col- 
leagues used 32 barcodes to tag a total of 60 A. thaliana accessions 
sequenced in two sequencing lanes. This required the preparation 
of 60 different libraries, but allowed the exact identification of 
each accession. 

SEQUENCING DEPTH 

The number of sequencing reads produced in NGS experiments 
is impressive and constantly increasing. Kharabian-Masouleh and 
collaborators generated a total of almost 4 Gb of sequence, while 
Marroni and colleagues generated about 2.5 Gb. Maughan and 
colleagues generated a total of 1 Gb using 454 and 1 .5 Gb using Illu- 
mina GAIL No explicit information on sequencing data is available 
for the work of Tsai and colleagues, but indirect evidence suggests 
that they generated a comparable amount of sequence data. 

In all the studies, the amount of generated reads resulted in a 
very high coverage (the number of times any position of the target 
region is sequenced) . In each pool, coverage can be calculated as 
the total length of generated sequence L, divided by the length of 
the target region, /. In pooled NGS experiments, an important 
measure is the Mean Individual Coverage (MIC), expressing the 
average number of times that each accession is sequenced. MIC 
is given by LI (I x n), where n is the number of accessions in 
the pool. 

It is often useful to know the value of MIC at each base position, 
also referred to as Mean Individual Base Coverage (MIBC). MIBC 
is calculated as R/n where R is the number of reads covering the 
considered base. MIBC is essential to filter positions with exces- 
sively high or low coverage, while MIC is useful to give a general 
idea of the coverage of the experiment. 

Pooled experiments aiming at the identification of rare variants 
require a high MIC. In bidimensional pooling experiments, Tsai 
and colleagues aimed at obtaining an MIC of about 75 x (30 x per 
row and 45 x per column). Unidimensional pooling requires even 
higher coverage; Kharabian-Masouleh and colleagues reported an 
MIC ranging 60-172 x , while Marroni and colleagues reported an 
MIC ranging 100-350 x, depending on the considered amplicon. 
Maughan and colleagues obtained an MIC of 4.3 x with 454 data 
and 67 x for Illumina GAII data (Table 2). 



Table 2 | Summary statistics of reads and coverage obtained in three 
NGS studies. 



/.(Mb) 


/-(bp) 


n 


MIC 


Reference 


3928 


109,067 


233 


155 


Kharabian-Masouleh etal. (2011) 


900 


2491 


768 


470 


Marroni etal. (2011a)* 


1650 


11,500 


768 


187 


Marroni etal. (2011a)** 


1052 


4,063,602 


60 


4.3 


Maughan etal. (2010)*** 


1470 


688,753 


32 


67 


Maughan etal. (2010)**** 



L, total length of generated sequence; I, total length of reference sequence; n, 
number of accessions; MIC, mean individual coverage, calculated as L/(l x n). 
* Phase 1 of the experiment; ** Phase 2 of the experiment; ***Sequencing 
performed using Roche GS-FLX454; ****Sequencing performed Illumina GA II. 



Three of the reviewed studies (Maughan etal., 2010; 
Kharabian-Masouleh etal., 2011; Marroni etal., 2011a) provided 
sufficient details to investigate how MIC, calculated as LI (I x n), 
is affected by the variation of (1) the total amount of generated 
sequence (L); (2) the length of the reference "genome" (or tar- 
geted region, /); and (3) the number of subjects present in the 
pool (n). 

The differences in MIC between studies in spite of the simi- 
lar amount of generated sequence reflect different experimental 
designs (Table 2). The two extremes are the works of Marroni and 
colleagues and that of Maughan and colleagues. The first explicitly 
aimed at the identification of rare variants in a very large number 
of accessions (large n) and focused on a small region (small /). The 
work of Maughan and colleagues aimed at discovering (common) 
variants in a large proportion of the genome (very large /) in a 
moderately sized sample (small n). 

DEALING WITH SEQUENCING ERRORS 

The search for rare variants is hampered by errors in the sequenc- 
ing process. Errors might be introduced by inaccuracy of the 
DNA polymerase during amplification or by mistakes in the iden- 
tification of the incorporated nucleotide in the sequencing by 
synthesis process used by most NGS instruments. Most researchers 
decide to use high-fidelity polymerases to reduce the error rate of 
the amplification step (Kharabian-Masouleh etal., 2011; Marroni 
etal., 2011a). In Section "Variants Identification and Validation," 
we discuss further the improvement due to the use of a high- 
fidelity polymerase according to the results reported by Marroni 
and colleagues. 

Another source of errors is the sequencing by synthesis process 
itself. Such error is low but not negligible and is reported to be in 
the range of 0.5— 1% (Kharabian-Masouleh etal., 2011). In multi- 
plexed experiments, errors in the sequencing step will affect both 
the sequencing reads, potentially causing false-positive findings, 
and the DNA barcode (index), causing the loss of reads for which 
the barcode cannot be unambiguously determined. Marroni and 
colleagues reported that about 4% of the reads contained at least 
one error in the barcode. 

BI0INF0RMATICS 

In NGS experiments involving one single individual, researchers 
can record three genotypic states: (1) the subject does not carry 
the variant (no sequencing read carry the variant); (2) the subject 
is heterozygous for the variant (roughly 50% of the reads carry the 
variant); and (3) the subject is homozygous carrier of the variant 
(all the reads carry the variant). These thresholds, with reason- 
able approximations and/or with the aid of statistical models, are 
used to determine genotype of individuals in NGS experiments. 
The basic assumption of the approach is that the number of reads 
carrying a variant is roughly proportional to the number of chro- 
mosomes carrying the same variant. It is possible to apply the 
same concept to experiments involving the sequencing of sev- 
eral individuals by assuming that the frequency of the variant in 
the population is roughly proportional to the fraction of reads 
carrying the variant itself. 

However, the identification of variants in pooled samples poses 
additional challenges. In particular, it requires optimizing SNP 
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detection, so that rare variants can be discriminated from errors. 
This might require adapting existing algorithms for variant calling 
to pooled samples (Marroni et al, 201 la) or developing completely 
novel algorithms (Tsai et al, 201 1), although some researchers pre- 
fer to rely on the use of commercially available software packages 
(Kharabian-Masouleh et al., 2011). 

The first step in the detection of variants in pooled NGS exper- 
iments is the alignment to a reference sequence. Several software 
packages are available to perform this task. Kharabian-Masouleh 
and colleagues used a commercial software package, CLCbio 1 , 
Marroni and colleagues used a free version of the commercial 
aligner Novoalign 2 and Tsai and colleagues used the freely avail- 
able BWA aligner (Li and Durbin, 2009). The second step is the 
variant calling itself. Again, researchers may use several available 
tools. Kharabian-Masouleh and colleagues relied on CLCbio, Mar- 
roni and coworkers used Varscan (Koboldt et al., 2009), while Tsai 
and colleagues used CAMBa (Missirian et al., 201 1). 

To obtain reliable results, all authors imposed limitations 
regarding base coverage and the proportion of reads carrying 
the variant. Bases with coverage below a defined threshold are 
discarded because they do not allow confident assignment of geno- 
type. In addition, variants occurring with low frequency cannot 
be confidently distinguished from errors. Such step is essential 
to perform variant calling in pooled samples, and can be easily 
performed with commercially available software (Kharabian- 
Masouleh et al, 201 1). However, several authors acknowledge that 
additional improvements are needed. Such improvements often 
require to develop custom algorithms or at least to perform custom 
operations on sequencing data. Tsai and colleagues used CAMBa 
(Missirian etal., 2011), a pipeline optimized to work with mul- 
tidimensional pooling strategies. It takes into consideration the 
pooling depth, the expected frequency and type of mutations, 
the probability of heterozygosis, the sequencing quality of the 
call, and the frequency of forward and reverse reads. Marroni 
and colleagues used additional filtering steps to improve variant 
calling accuracy. Relying on the observation that false-positive 
SNPs can arise at the b' and 3 r ends of Illumina reads due to the 
higher error rate of the sequencing process or the presence of small 
insertions/deletions near the read ends, they introduced two addi- 
tional filtering steps. They (1) analyzed the forward and reverse 
strands separately and required that each SNP was identified in 
both strands and (2) partitioned each read into three segments of 
equal length, and required that each SNP was identified in each of 
the segments of the read. 

VARIANTS IDENTIFICATION AND VALIDATION 

Two of the reviewed studies compared variants identified by 
pooled multiplexed NGS with those identified in the same acces- 
sions by individual screening with an established protocol used as a 
gold standard ( Marroni et al., 2011a; Tsai et al, 20 1 1 ) . Both studies 
showed excellent agreement between pooled NGS and individual 
Sanger sequencing. Tsai and colleagues tested in total 84 variants 
passing an ad hoc probability threshold. Of them, only one was 
identified as a false positive. Marroni and colleagues tested 137 
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variants passing an ad hoc frequency threshold, and six of them 
were false positives. 

Validation of the identified variants showed that pooled NGS 
reliably identified variants with frequency lower than 1%. Mar- 
roni and coworkers performed an ROC analysis comparing pooled 
NGS and individual Sanger sequencing on a randomly chosen sub- 
set, and used this analysis to choose the variant frequency threshold 
maximizing true positives while minimizing false positives. Their 
results showed that the optimal frequency threshold was around 
0.4%. Tsai and colleagues also identified variants with frequen- 
cies below the 1% threshold. Kharabian-Masouleh and colleagues 
reported for the Illumina GA II sequencer an error rate in the 
range 0.5-1%. The studies from Marroni and Tsai showed that 
the achievement of very high MIC and the development of spe- 
cific algorithms for variant calling allowed to overcome this limit 
and to effectively identify rare variants. An additional experiment 
performed by Marroni and coworkers showed the effect of poly- 
merase accuracy on the error rate of the sequencing process. The 
authors used a high-fidelity polymerase to amplify one of the stud- 
ied amplicons, while the remaining amplicons were amplified by a 
standard polymerase (Marroni etal., 2011a). To assess differences 
due to the use of different polymerases, they measured the pro- 
portion of polymorphic positions in each amplicon as a function 
of the frequency threshold applied for variant calling (Figure 3). 
At very low frequency thresholds, the proportion of bases carry- 
ing a putative SNP was between 0.9 and 1.0. Given that the actual 
proportion of polymorphic positions in the studied genes was less 
than 5%, most of such putative SNPs were false positives. Increas- 
ing the frequency threshold for variant calling caused a decrease 
in the proportion of false positives. These results showed that the 
use of a high-fidelity polymerase sensibly decreased false posi- 
tives, thus suggesting that a significant amount of errors in NGS 
of pooled amplicons were due to DNA polymerase errors. In con- 
clusion, the use of a high-fidelity polymerase can sensibly decrease 
the error rate and the optimal variant frequency threshold, and 
increase sensitivity and specificity of the method. However, at 
relatively high frequency thresholds, the performance of the stan- 
dard polymerase is comparable to that of high-fidelity polymerase. 
Thus, the authors suggested the use of high-fidelity polymerase for 
amplification of the target genes only when rare variants need to 
be identified. The observation that DNA polymerase errors are 
responsible for the identification of potential false positives have 
been put forward also by Tsai and colleagues. In their work, DNA 
polymerase errors gave rise to "orphan" mutations, i.e., mutations 
that in bidimensional or three-dimensional pooling were identi- 
fied only in one pool. Tsai and colleagues were able to decrease 
the occurrence of orphans by increasing the amount of DNA 
template. 

A number of studies in human and plant genetics suggest that 
pooled multiplexed NGS gives reliable estimates of SNP allele 
frequency (Ingman and Gyllensten, 2009; Out etal., 2009; Mar- 
roni etal, 2011a). Several population genetics parameters can be 
estimated using information on allele frequency and sample size. 
One of such parameters is nucleotide diversity, that can be mea- 
sured as the sum of the unbiased heterozygosity of segregating sites 
(Tajima, 1989), averaged over all nucleotides. The work by Mar- 
roni and colleagues showed that pooled multiplexed NGS gives 
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FIGURE 3 | Number of called SNP per nucleotide as a function of the 
frequency set as the threshold for the five genes studied in the 
experiment. The dots represent the frequency threshold at which the 
proportion of bases identified as polymorphic is 50% of the maximum 
value. Blue lines represent amplicons sequenced using a standard 
polymerase. Red line represents amplicons sequenced using a high-fidelity 
polymerase. 
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FIGURE 4 | Nucleotide diversity calculated using allele frequency 
estimated by pooled multiplexed NGS (red bars) and by individual Sanger 
sequencing (blue bars). 



reliable estimates of nucleotide diversity, showing strong correla- 
tion with estimates obtained with individual Sanger sequencing 
(Figure 4). Another parameter that can be estimated using pooled 
multiplexed NGS is Tajima's D, a statistic used to test for departure 
from neutral evolution (Tajima, 1989). Negative values of D indi- 
cate negative selection, while positive values of D are suggestive of 
balancing selection. 

FUTURE PERSPECTIVES 

With the ever increasing sequence output of next generation 
sequencers, it is becoming feasible to sequence megabases of tar- 
get regions at high coverage in a large number of subjects. In such 
an experiment, the PCR amplification would become the bottle- 
neck (Mamanova etal., 2010). An alternative approach is the use 
of hybridization-based target enrichment methods (Gnirke etal., 
2009), which can target regions from few to hundreds of megabases 
(the size of human exome). However, commercial products based 
on hybridization can sensibly increase sequencing costs. An alter- 
native method is the use of a set of restriction enzymes to fragment 
DNA, followed by targeted circularization and amplification of 
the target regions (Dahl etal., 2007). The commercial application 
of the latter approach is less expensive than hybridization-based 
applications and can target regions from less than 1 Mb to hun- 
dreds of megabases, but is at present only available for human 
genome. Adapting the technology to other organisms is feasi- 
ble, but it would require the optimization of a different set of 
restriction enzymes for DNA fragmentation. A similar approach 
has already been implemented to genotype an A. thaliana map- 
ping population (Maughan etal, 2010). The work by Maughan 



and coworkers confirmed that the construction of reduced repre- 
sentation libraries is a valid and cost-effective approach to target 
selection in plants. 

Another interesting option is the use of hybridization-based 
target enrichment methods on pooled multiplexed samples. The 
advantage of this approach would be that only one hybridization 
experiment is needed to perform target enrichment in several sub- 
jects. This approach has already been successfully applied for the 
identification of mutations in humans (Cummings etal., 2010; 
Kenny etal., 2011) and in rats (Nijman etal, 2010). The possi- 
bility of performing the hybridization step on pooled multiplexed 
samples can sensibly reduce costs and labor and is an interesting 
opportunity for plant genomics experiments. 

Researchers in human genetics can target the whole human 
genome by exon capture (Hodges etal., 2007). Several compa- 
nies offer commercial kits to perform whole exome enrichment in 
human subjects. This option is extremely appealing since allows 
to obtain sequence information on the entire coding portion of 
the human genome at a reduced cost. The cost of sequencing a 
human exome is already well below 1000$ if only sequencing costs 
are considered (Mertes etal, 2011). 

As of today, the saving in sequencing costs obtained using target 
enrichment is greater than the increase in costs of library prepara- 
tion due to enrichment protocols (Mamanova etal, 2010). Given 
the steady decrease of sequencing costs, it is possible that in the 
future whole-genome sequencing of large samples will be more 
efficient than sequencing targeted regions. However, at present, 
only large research consortia can undertake whole-genome rese- 
quencing in sample sizes larger than 100 subjects. This is the case of 
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the 1000 Genomes Project, in which nearly 200 subjects underwent 
low-coverage (<6x) genome sequencing and nearly 700 subjects 
underwent high coverage sequencing of slightly less than 1000 
genes (The 1000 Genomes Project Consortium, 2010), and of the 
1001 Genomes, a project aiming at surveying genetic variation in 
A. thaliana (Weigel and Mott, 2009), which produced so far com- 
plete genome sequence of several A. thaliana strains. Although we 
hypothesize that a long-term aim will be to obtain high coverage 
sequence of whole genome of large samples, we foresee that in the 
near future the search for rare variants will still be driven by the 
use of targeted resequencing. 

CONCLUSION 

Pooled NGS of target regions is an efficient method to screen 
large populations for polymorphisms. It has been applied with 
excellent results to mainly self- pollinating crops such as wheat 
and rice (Kharabian-Masouleh et al., 201 1; Tsai et al, 201 1) and to 
outcrossing trees (Marroni etal., 2011a). 

The main interest of pooled NGS of target regions is the 
identification of variants with a functional role (Marroni etal., 
2011a; Tsai etal., 2011). Marroni and colleagues identified a 
mutation causing a premature stop codon in HCT1, a gene 
involved in lignin production. Phenotype of mutation carriers is 
now being thoroughly investigated, and experimental crosses are 
being carried out to extensively evaluate the potential impact of 
the mutation. If the identified mutation has a functional role, 
then it will be of importance for improving poplar breeding 
programs. 

The first studies using multiplexed pooled NGS were performed 
in humans (Ingman and Gyllensten, 2009; Out etal., 2009) and 
reported a good correlation between the proportion of reads carry- 
ing a polymorphism and the polymorphism frequency. The ability 



to accurately estimate allele frequency enables researchers to accu- 
rately estimate several population genetics parameters, such as 
nucleotide diversity and Tajimas D. One of the reviewed stud- 
ies used pooled multiplexed NGS to obtain accurate estimates of 
nucleotide diversity and Tajimas D (Marroni etal., 2011a). Com- 
parison of nucleotide diversity estimated by pooled multiplexed 
NGS with estimates obtained by individual Sanger sequenc- 
ing showed a strong agreement between the two approaches 
(Figure 4). 

Thus, pooled multiplexed NGS can be successfully used to accu- 
rately estimate population genetics parameters. The approach is 
also naturally suited for projects investigating the genetic struc- 
ture of different populations. In projects investigating the genetic 
background of two or more populations, pooling of individuals 
of each population would allow saving resources without loss of 
information. 

Finally, the ability to contrast allele frequencies between 
two samples will also greatly facilitate pooled association map- 
ping, a method already proposed to identify association between 
polymorphisms and qualitative traits (Sham etal., 2002). 

We foresee that the increase of next generation sequencers 
throughput, the availability of cost-effective targeting methods 
and the improvement of multiplexing capabilities will lead to a 
further improvement of performance, making it feasible to geno- 
type thousands of variants in large numbers of individuals in a 
single sequencing run retaining individual information. 
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